Introduction to the X10 Implementation of NPB MG

نویسنده

  • Tong Wen
چکیده

performance metric FP PERCLOCK 4 LATENCY 375 (cycles) BANDWIDTH 5.3 (bytes/cycle) Computation Cost (Class S) 3,272,372(cycles) Ideal Speedup Number of places 1 2 4 8 Bulk-synchronous 1 1.74 3.17 5.46 Overlap comm & comp 1.19 2.22 4.08 7.05 Overlap comm 1.19 2.35 4.51 8.33 Replace for with foreach 5382.2 27.6 32.8 37.3 Figure 4: The ideal speedups are computed for four cases. For each case, different number of places are used to run the benchmark (Class S). The abstract performance metrics are defined in the top table. The computation and communication costs are inserted in the code manually. The runtime computes the critical path length. The base case is the bulk-synchronous implementation (SPMD style). Then parallelism is added gradually. First, the communication and computation are overlapped in each stencil operation. Secondly, communication is overlapped in the procedure of updating ghost values. Lastly, each unordered for loop in every stencil operation is replaced with a foreach loop. Abstract performance metric FP PERCLOCK 4 LATENCY 37.5 (cycles) BANDWIDTH 5.3 (bytes/cycle) Computation Cost (Class S) 3,272,372(cycles) Ideal speedup Number of places 1 2 4 8 Bulk-synchronous 1 1.85 3.58 6.83 Overlap comm & comp 1.19 2.38 4.67 9.07 Overlap comm 1.19 2.43 4.83 9.55 Replace for with foreach 5382.2 53.0 77.3 112.3performance metric FP PERCLOCK 4 LATENCY 37.5 (cycles) BANDWIDTH 5.3 (bytes/cycle) Computation Cost (Class S) 3,272,372(cycles) Ideal speedup Number of places 1 2 4 8 Bulk-synchronous 1 1.85 3.58 6.83 Overlap comm & comp 1.19 2.38 4.67 9.07 Overlap comm 1.19 2.43 4.83 9.55 Replace for with foreach 5382.2 53.0 77.3 112.3 Figure 5: The setting is the same as in Figure 4 except that the communication latency is reduced by a factor of 10. We can see that communication latency is the performance bottleneck of this kind of application In the last case shown in Figure 4, the ideal speedup decreases dramatically from using 1 place to 2 places. The cause is the high communication-computation ratio, that is, the cost to partition computation across computing nodes (places) is relatively high. In Figure 5, the communication latency is reduced by a factor of 10. We can see that communication latency is the performance bottleneck of this kind of applications.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

سنتز و فرمولاسیون سیستم بایندر- نرم‌کننده NHTPB-NPB و بررسی خواص عملکردی آن در PBXN-109 اصلاح ‌شده

In this work, nitration of low molecular weight polybutadiene (PB) by a convenient and inexpensive procedure has been investigated. The product (Nitropolybutadiene (NPB) energetic plasticizer) was characterized by FT-IR, 1H-NMR, GPC, TGA, DSC etc. Then NPB energetic polymer plasticizer and nitro-hydroxyl terminated polybutadiene (NHTPB) binder have been replaced with dioctyladiphate (DOA) inert...

متن کامل

Xenopus Nkx6.3 Is a Neural Plate Border Specifier Required for Neural Crest Development

In vertebrates, the neural plate border (NPB) is established by a group of transcription factors including Dlx3, Msx1 and Zic1. The crosstalk between these NPB specifiers governs the separation of the NPB region into placode and neural crest (NC) territories and also their further differentiation. Understanding the mechanisms of NPB formation and NC development is critical for our knowledge of ...

متن کامل

NAS Parallel Benchmarks I/O Version 2.4

We describe a benchmark problem, based on the Block-Tridiagonal (BT) problem of the NAS Parallel Benchmarks (NPB), which is used to test the output capabilities of high-performance computing systems, especially parallel systems. We also present a source code implementation of the benchmark, called NPBIO2.4MPI, based on the MPI implementation of NPB, using a variety of ways to write the computed...

متن کامل

An X10 Compiler for Invasive Architectures

We study the compilation of X10 to novel, highly scalable hardware architectures in the scope of the InvasIC project. To this end, we describe the implementation of a machine code backend and its integration into the existing X10 compiler. In our implementation, the graph-based intermediate representation Firm is used. We identify several issues in the current compiler architecture related to t...

متن کامل

Substituent Effects on the Structural and Nonlinear Optical Properties of 1-[4-({(E)-[4-(methylsulfanyl)phenyl]methylidene}amino)phenyl]ethanone and Some of its Substituted Derivatives- a Theoretical Method

This work investigates the structural and nonlinear optical properties of a D-A type 1-[4-({(E)-[4-(methylsulfanyl)phenyl]methylidene}amino)phenyl]ethanone, MMP in which charge transfer occurs from -SCH3 donor to -COCH3 acceptor group through methylidene backbone; and some of its modeled analogues using quantum chemical calculations with pure BLYP and hybrid B3LYP correlation with high basis se...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006